Gpu known subgroup size #4

FMarno · 2024-10-17T15:45:48Z

No description provided.

This manifested as an assertion failure in Clang built against libc++ with hardening enabled (e.g. -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_DEBUG): `libcxx/include/__memory/unique_ptr.h:596: assertion __checker_.__in_bounds(std::__to_address(__ptr_), __i) failed: unique_ptr<T[]>::operator[](index): index out of range`

After 7f74651, the pointer operand may be replicated of a PtrAdd. Instead of requesting a single scalar, request lane 0, which correctly handles the case when there is a scalar-per-lane. Fixes llvm#111606.

Prep work for llvm#110875

This commit adds the ViewLikeOpInterface to the GEP and AddrSpaceCast operations. This allows us to simplify the inliner interface. At the same time, the change also makes the inliner interface more extensible for downstream users that have custom view-like operations.

…aders for LinalgDialect (llvm#111603) This fixes non-deterministic build failures. Fixes llvm#111527 --------- Co-authored-by: zecheng.zhang <[email protected]> Co-authored-by: Mehdi Amini <[email protected]>

llvm#111451) We add static methods to APFloatBase to allow the hasZero and hasSignedRepr properties of fltSemantics to be obtained.

) On Apple platforms, using system-libcxxabi as an ABI library wouldn't work because we'd try to re-export symbols from libc++abi that the system libc++abi.dylib might not have. Instead, only re-export those symbols when we're using the in-tree libc++abi. This does mean that libc++.dylib won't re-export any libc++abi symbols when building against the system libc++abi, which could be fixed in various ways. However, the best solution really depends on the intended use case, so this patch doesn't try to solve that problem. As a drive-by, also improve the diagnostic message when the user forgets to set the LIBCXX_CXX_ABI_INCLUDE_PATHS variable, which would previously lead to a confusing error. Closes llvm#104672

…11533) This patch implements speculation for vector.transfer_read/vector.transfer_write ops, allowing these ops to work with LICM.

Add the permutation clause for the interchange directive which will be introduced in the upcoming OpenMP 6.0 specification. A preview has been published in [Technical Report12](https://www.openmp.org/wp-content/uploads/openmp-TR12.pdf).

…ls. (llvm#109708) Make legacy cost retrieval independent of getInstructionForCost by sinking it to more specific ::computeCost implementation (specifically VPInterleaveRecipe::computeCost and VPSingleDefRecipe::computeCost). Inline getInstructionForCost to VPRecipeBase::cost(), as it is now only used to decide which recipes to skip during cost computation and when to apply forced costs. PR: llvm#109708

…t(pcmpeq(and(X,Pow2),Pow2),B,A) Matches what we already do in LowerVSETCC to reuse an existing constant Fixes llvm#110875

…lvm#111600) The `SymbolTableListTraits` template is explicitly instantiated for the following types: * `llvm/lib/IR/Function.cpp` - `BasicBlock` * `llvm/lib/IR/Module.cpp` - `Function` - `GlobalAlias` - `GlobalIFunc` - `GlobalVariable` When LLVM is built on Windows with the `LLVM_EXPORT_SYMBOLS_FOR_PLUGINS` option enabled, the implicit instantiation of the template prevents the `SymbolTableListTraits` template from being exported. This causes link errors when the template or IR API is used in a plugin. This change prevents the template being implicitly instantiated for these types.

…1540) Currently this test is completely xfailed as part of the patch llvm#106077. But this test works on A and R profile, not in v7M profile. Because the test contain cases in which m-profile will fail for atomic types greater than 4 bytes in size.

This was using addrspace 0 and 1 pointers interchangably. This works out since they happen to use the same size, but consistently query or use the correct one.

…#111541 into a canonicalization (llvm#111614) This is a reasonable canonicalization because `extract` is more constrained than `extract_strided_slices`, so there is no loss of semantics here, just lifting an op to a special-case higher/constrained op. And the additional `shape_cast` is merely adding leading unit dims to match the original result type. Context: discussion on llvm#111541. I wasn't sure how this would turn out, but in the process of writing this PR, I discovered at least 2 bugs in the pattern introduced in llvm#111541, which shows the value of shared canonicalization patterns which are exercised on a high number of testcases. --------- Signed-off-by: Benoit Jacob <[email protected]>

With some restrictions, BIND(C) derived types can be converted to compatible BIND(C) derived types. Semantics already support this, but ConvertOp was missing the conversion of such types. Fixes llvm#107783

…#111666) Fixes llvm#111460.

These should be well behaved address computations.

…oops (llvm#111656) Properly handles `cycle` branching inside target distribute loops.

Same logic as other callsites, if the attributes are intersectable, we merge. Closes llvm#111713

…#111759) These were split in 0e8208e, with the only functional difference between them at the time being `--prepend_env PATH=%{lib-dir}` in the static config and `--prepend_env PATH=%{install-prefix}/bin` in the shared library config. However this difference is unnecessary - the static library config doesn't need any `--prepend_env` argument at all. Before 0e8208e, both configurations used the same config file, where the `--prepend_env` argument was unnecessary but benign in the static case. Reduce the unnecessary config duplication in this case, and return these configs to using one single config file for both setups.

…#111700)

FMINNM/FMAXNM instructions of AArch64 follow IEEE754-2008. We can use them to canonicalize a floating point number. And FMINNUM_IEEE/FMAXNUM_IEEE is used by something like expanding FMINIMUMNUM/FMAXIMUMNUM, so let's define them. Update combine_andor_with_cmps.ll. Add fp-maximumnum-minimumnum.ll, with nnan testcases only. V1F64 is not supported yet. If we set v1f64 as legal, FMINNUM/FMAXNUM will have some problem: both of them use `if (isOperationLegalOrCustom(FMAXNUM_IEEE, VT))`. AArch64 depends on `expandFMINNUM_FMAXNUM` returning `SDValue()` for FMAXNUM and FMINNUM. We should fix this problem, while it will be in future patch.

This finishes the clang implementation of P0522, getting rid of the fallback to the old, pre-P0522 rules. Before this patch, when partial ordering template template parameters, we would perform, in order: * If the old rules would match, we would accept it. Otherwise, don't generate diagnostics yet. * If the new rules would match, just accept it. Otherwise, don't generate any diagnostics yet again. * Apply the old rules again, this time with diagnostics. This situation was far from ideal, as we would sometimes: * Accept some things we shouldn't. * Reject some things we shouldn't. * Only diagnose rejection in terms of the old rules. With this patch, we apply the P0522 rules throughout. This needed to extend template argument deduction in order to accept the historial rule for TTP matching pack parameter to non-pack arguments. This change also makes us accept some combinations of historical and P0522 allowances we wouldn't before. It also fixes a bunch of bugs that were documented in the test suite, which I am not sure there are issues already created for them. This causes a lot of changes to the way these failures are diagnosed, with related test suite churn. The problem here is that the old rules were very simple and non-recursive, making it easy to provide customized diagnostics, and to keep them consistent with each other. The new rules are a lot more complex and rely on template argument deduction, substitutions, and they are recursive. The approach taken here is to mostly rely on existing diagnostics, and create a new instantiation context that keeps track of this context. So for example when a substitution failure occurs, we use the error produced there unmodified, and just attach notes to it explaining that it occurred in the context of partial ordering this template argument against that template parameter. This diverges from the old diagnostics, which would lead with an error pointing to the template argument, explain the problem in subsequent notes, and produce a final note pointing to the parameter.

Summary: Option `-fskip-odr-check-in-gmf` is set by default and I think it is what most of C++ developers want. But in header units, Clang ODR checking is too strict, making them hard to use, as seen in the example in the diff. This diff relaxes ODR checks for unnamed modules to match GMF ODR checking. Test Plan: check-clang

…07350) With this change, we discriminate if the primary template and which partial specializations would have participated in overload resolution prior to P0522 changes. We collect those in an initial set. If this set is not empty, or the primary template would have matched, we proceed with this set as the candidates for overload resolution. Otherwise, we build a new overload set with everything else, and proceed as usual.

…te calls. (llvm#111457) Clang previously missed implementing P0522 pack matching for deduced function template calls. Fixes llvm#111363

@antiagainst

Extra builders for CallIntrinsicOp. This is inspired by the comment from @antiagainst from [here](llvm#108933 (comment)).

…t attributes and undeclared templates (llvm#107786) Fixes llvm#107047 Fixes llvm#49093

…11679) This DAG combine replaces a floating-point load/store pair which has no other uses with an integer one, but did not copy the memory operand flags to the new instructions, resulting in it dropping the volatile flag. This optimisation is still valid if one or both of the instructions is volatile, so we can copy over the whole MachineMemOperand to generate volatile integer loads and stores where needed.

These might also be called with vectors, but we don't support that.

This does a global rename from `flang-new` to `flang`. I also removed/changed any TODOs that I found related to making this change. --------- Co-authored-by: H. Vetinari <[email protected]> Co-authored-by: Andrzej Warzynski <[email protected]>

llvm#111797) This commit fixes a bug in the import of nameless globals. Before this change, the fake symbol names were only generated during the transformation of the definition. This caused issues when the symbol was used before it was defined.

…e. (llvm#111428) Similar to 112aac4, this converts log libcalls to llvm.log.f64 intrinsics if we know they do not set errno, as the input is not zero and not negative. As log will produce errno if the input is 0 (returning -inf) or if the input is negative (returning nan), we also perform the conversion when we have noinf and nonan.

follow up work of llvm#106229, add create pass overload function to create pass. --------- Co-authored-by: jingzec <[email protected]>

@farzonl

- Add handling for unsigned integers to hlsl_elementwise_sign - Use `select` instead of adding dx and spirv intrinsics for unsigned integers as [discussed previously ](llvm#101988 (comment)) fixes llvm#70078 ### Related PRs - llvm#101987 - llvm#101988 - llvm#101989 cc @farzonl @pow2clk @bob80905 @bogner @llvm-beanz

Saves me searching for this every time someone asks.

…XTRACT_SUBVECTOR(V,C1+C2) (llvm#111685) Extract from the original source vector whenever possible. This removes a number of dependency bottlenecks and helps a number of shuffle combining cases: either by allowing us to avoid a cross-lane variable shuffle on a slow target by keeping the instruction count below the threshold, or on fast targets make it easier to recognise that the subvectors all came form the same source.

…m#111747) This module is used in various helper scripts since llvm#93712

…lvm#111720) Fixes missing m0 initialize for pre-gfx9 targets with local extending loads.

Implement the addMachineSSAOptimizations passes for AMDGPU. Porting the other generic passes in this category is WIP.

Run ArgumentPromotion before IPSCCP in the LTO pipeline, to expose more constants to be propagated. We also run PostOrderFunctionAttrs to improve the information available to ArgumentPromotion's alias analysis, and SROA to clean up allocas.

FMarno · 2024-10-17T15:46:37Z

mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp


+  static std::optional<uint32_t>
+  getIntelReqdSubGroupSize(FunctionOpInterface func) {
+    constexpr llvm::StringLiteral discardableIntelReqdSubgroupSize =


It would be good if we could get this from a function like a IntelReqdSubgroupSizeAttrName function

FMarno · 2024-10-17T15:47:02Z

mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir

  // CHECK-SAME:               %[[I32_VAL:.*]]: i32, %[[I64_VAL:.*]]: i64,
  // CHECK-SAME:               %[[F16_VAL:.*]]: f16, %[[F32_VAL:.*]]: f32,
-  // CHECK-SAME:               %[[F64_VAL:.*]]: f64,  %[[OFFSET:.*]]: i32) {
+  // CHECK-SAME:               %[[F64_VAL:.*]]: f64,  %[[OFFSET:.*]]: i32)


Suggested change

// CHECK-SAME: %[[F64_VAL:.*]]: f64, %[[OFFSET:.*]]: i32)

// CHECK-SAME: %[[F64_VAL:.*]]: f64, %[[OFFSET:.*]]: i32) attributes {gpu.known_subgroup_size = 16 : i32} {

include the attribute in the check

Also use it for lowering in GPUToLLVMSPV

…_size In the GPU To LLVM SPV patterns

…en issue Since llvm#109628 landed, this test has been failing on 32-bit Arm. This is due to a codegen problem (whether added or uncovered by the change, not known) where the trap instruction is placed after the frame pointer and link register are restored. llvm#113154 So the code was: ``` std::__1::vector<int>::operator[](unsigned int): sub sp, sp, llvm#8 str r0, [sp, #4] str r1, [sp] add sp, sp, llvm#8 .inst 0xe7ffdefe bx lr ``` When lldb saw the trap, the PC was inside operator[] but the frame information actually pointed to g. This bug only happens for leaf functions so adding a return type works around it: ``` std::__1::vector<int>::operator[](unsigned int): push {r11, lr} mov r11, sp sub sp, sp, llvm#8 str r0, [sp, #4] str r1, [sp] mov sp, r11 pop {r11, lr} .inst 0xe7ffdefe bx lr ``` (and operator[] should return T& anyway) Now the PC location and frame information should match and the test passes.

alexfh and others added 30 commits October 9, 2024 14:15

[VPlan] Request lane 0 for pointer arg in PtrAdd.

01cbbc5

After 7f74651, the pointer operand may be replicated of a PtrAdd. Instead of requesting a single scalar, request lane 0, which correctly handles the case when there is a scalar-per-lane. Fixes llvm#111606.

[X86] Add isConstantPowerOf2 helper to replace repeated code. NFC.

25c3ecf

Prep work for llvm#110875

[X86] vselect-pcmp.ll - regenerate test checks with vpternlog comments

e17f701

[X86] Add test coverage for llvm#110875

4b4078a

[mlir] add missing CMake dependency on ShardingInterface generated he…

3b2bfb4

…aders for LinalgDialect (llvm#111603) This fixes non-deterministic build failures. Fixes llvm#111527 --------- Co-authored-by: zecheng.zhang <[email protected]> Co-authored-by: Mehdi Amini <[email protected]>

[APFloat] add predicates to fltSemantics for hasZero and hasSignedRepr (

3b7091b

llvm#111451) We add static methods to APFloatBase to allow the hasZero and hasSignedRepr properties of fltSemantics to be obtained.

AMDGPU: Regenerate test checks

890e481

[mlir][vector] Implement speculation for vector.transferx ops (llvm#1…

32db6fb

…11533) This patch implements speculation for vector.transfer_read/vector.transfer_write ops, allowing these ops to work with LICM.

[X86] combineSelect - Fold select(pcmpeq(and(X,Pow2),0),A,B) -> selec…

c47f3e8

…t(pcmpeq(and(X,Pow2),Pow2),B,A) Matches what we already do in LowerVSETCC to reuse an existing constant Fixes llvm#110875

AMDGPU: Use pointer types more consistently (llvm#111651)

1e357cd

This was using addrspace 0 and 1 pointers interchangably. This works out since they happen to use the same size, but consistently query or use the correct one.

[flang] Implement conversion of compatible derived types (llvm#111165)

390943f

With some restrictions, BIND(C) derived types can be converted to compatible BIND(C) derived types. Semantics already support this, but ConvertOp was missing the conversion of such types. Fixes llvm#107783

[clang][bytecode] Implement __builtin_ai32_addcarryx* (llvm#111671)

6f8e855

[Transform] Avoid repeated hash lookups (NFC) (llvm#111620)

7d9f993

[DSE] Simplify code with MapVector::operator[] (NFC) (llvm#111621)

48e4d67

[NVPTX] Avoid repeated map lookups (NFC) (llvm#111627)

bda4fc0

[Clang] Avoid a crash when parsing an invalid pseudo-destructor (llvm…

1ad5f31

…#111666) Fixes llvm#111460.

[clang-tidy] Avoid repeated hash lookups (NFC) (llvm#111628)

c911b0a

[Conversion] Avoid repeated hash lookups (NFC) (llvm#111637)

01a0e85

[bazel] port 8e2ccdc

f59b151

AMDGPU: Add instruction flags when lowering ctor/dtor (llvm#111652)

e85fcb7

These should be well behaved address computations.

[LLVM][AArch64] Enable SVEIntrinsicOpts at all optimisation levels.

6654578

[flang][OpenMP] Don't check unlabelled cycle branching for target l…

c4d288d

…oops (llvm#111656) Properly handles `cycle` branching inside target distribute loops.

goldsteinn and others added 24 commits October 10, 2024 01:07

[SimplifyCFG] Allow merging invoke's with different attrs

82ac399

Same logic as other callsites, if the attributes are intersectable, we merge. Closes llvm#111713

[bazel] port dc85d52

c15611a

[clang][bytecode] Diagnose class-specific operator delete calls (llvm…

f93258e

…#111700)

[clang] Implement TTP P0522 pack matching for deduced function templa…

4dadf42

…te calls. (llvm#111457) Clang previously missed implementing P0522 pack matching for deduced function template calls. Fixes llvm#111363

[mlir][llvmir] Added extra builders for CallInstrinsicOp (llvm#111664)

741ad3a

Extra builders for CallIntrinsicOp. This is inspired by the comment from @antiagainst from [here](llvm#108933 (comment)).

[Clang] prevent recovery call expression from proceeding with explici…

1fa3c85

…t attributes and undeclared templates (llvm#107786) Fixes llvm#107047 Fixes llvm#49093

[clang][bytecode] Check new builtins for integer types (llvm#111801)

f1eac77

These might also be called with vectors, but we don't support that.

[mlir] add overload createDIScopeForLLVMFuncOp function (llvm#111689)

d124b98

follow up work of llvm#106229, add create pass overload function to create pass. --------- Co-authored-by: jingzec <[email protected]>

[lldb][docs] Add link to RISC-V tracking issue in Platform Support

993de55

Saves me searching for this every time someone asks.

[lldb] Check for Python 'packaging' module at configuration time (llv…

7890919

…m#111747) This module is used in various helper scripts since llvm#93712

AMDGPU/GlobalISel: Insert m0 initialization before sextload/zextload (l…

c36f902

…lvm#111720) Fixes missing m0 initialize for pre-gfx9 targets with local extending loads.

[AMDGPU][NewPM] Fill out AMDGPU addMachineSSAOptimizations (llvm#111658)

039e6f8

Implement the addMachineSSAOptimizations passes for AMDGPU. Porting the other generic passes in this category is WIP.

FMarno commented Oct 17, 2024

View reviewed changes

FMarno mentioned this pull request Oct 21, 2024

Remove redundant attributes specifying the number of threads per warp intel/intel-xpu-backend-for-triton#1313

Closed

FMarno added 2 commits October 22, 2024 10:47

[mlir][gpu] Add known subgroup size

aca998e

Also use it for lowering in GPUToLLVMSPV

[mlir] Use intel_reqd_sub_group_size as backup for gpu.known_subgroup…

31fd327

…_size In the GPU To LLVM SPV patterns

FMarno force-pushed the gpu_known_subgroup_size branch from 65542dd to 31fd327 Compare October 22, 2024 09:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gpu known subgroup size #4

Gpu known subgroup size #4

Uh oh!

FMarno commented Oct 17, 2024

Uh oh!

FMarno Oct 17, 2024

Uh oh!

FMarno Oct 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

76 participants

	// CHECK-SAME: %[[F64_VAL:.]]: f64, %[[OFFSET:.]]: i32)
	// CHECK-SAME: %[[F64_VAL:.]]: f64, %[[OFFSET:.]]: i32) attributes {gpu.known_subgroup_size = 16 : i32} {

Gpu known subgroup size #4

Are you sure you want to change the base?

Gpu known subgroup size #4

Uh oh!

Conversation

FMarno commented Oct 17, 2024

Uh oh!

FMarno Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

FMarno Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

76 participants